Airbnb Copenhagen
Introduction
Copenhagen, the happiest city in the world, Who doesn’t want to go there! Happiness comes at a price though, you’ll need to find a place to stay. What better way to look for a place to stay than on AirBnB and us being students, would like to know what this will cost us. This is where the MAM program from LBS is already paying itself off. To find out what variables (e.g. room type, property type) influence the price most so we can predict the cost with a regression analysis. So our dream holiday can become reality.
The data we will be using for this comes directly from AirBnB. In this analysis we will only be using data on the city of Copenhagen. the data shows the listings as per June 26th 2020.
This analysis is divided into 4 parts:
- Executive summary
- Exploratory data analysis
- Regression analysis
- Conclusion
Executive summary
Using AirBnB data for listings at the time of June 26th 2020, we predicted the costs of a stay in Copenhagen. The requirements were 4 nights with 2 persons, a private room and a minimum review score of 90 with at least 10 reviews. Our method was to run and exploratory data analysis to set the scene for the data that we would use to make our prediction. Once we had confirmed the important values that could have a significant influence on the price. we identified the following variables:
- property type
- number of reviews
- review score
- room type
- bathrooms
- bedrooms
- accommodates
- neighborhood
- cancellation policy
- Superhost
After running regression models with these variables we came to a model that had the most prediction power, model 7. With the specified criteria we made a prediction with 95% confidence that the cost for a median listing on Airbnb in Copenhagen is between 1,860.18 DKK and 1,938.58 DKK.
Exploratory data analysis
To get an idea of the data structure, an exploratory data analysis is conducted. The first steps are to look at the raw values and to clean the data afterwards. This gives a foundation to use in summary statistics. Finally, visualisations will give an impression of any correlations in variables and will give an outlook on variables that stand out.
Looking at the raw values
First of all, the data needs to be loaded. The clean_names() function is used to make the values consistent and neat, to help with the data wrangling.
listings <- vroom("http://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2020-06-26/data/listings.csv.gz") %>%
# Get the listings for Copenhagen
clean_names()
## Rows: 28,523
## Columns: 106
## Delimiter: ","
## chr [45]: listing_url, name, summary, space, description, experiences_offered, neighborho...
## dbl [40]: id, scrape_id, host_id, host_listings_count, host_total_listings_count, zipcode...
## lgl [16]: thumbnail_url, medium_url, xl_picture_url, host_is_superhost, host_has_profile_...
## date [ 5]: last_scraped, host_since, calendar_last_scraped, first_review, last_review
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
Having looked at the data, the following variables will be of importance throughout this analysis:
price : cost per night
cleaning_fee: cleaning fee
extra_people: charge for having more than 1 person
property_type: type of accommodation (House, Apartment, etc.)
room_type: Entire home/apt (guests have entire place to themselves) Private room (Guests have private room to sleep, all other rooms shared) Shared room (Guests sleep in room shared with others) number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighborhood*: three variables on a few major neighborhoods in each city
The next step is to check the data for the properties of the variables and if there are any missing or NA values that need to be addressed.
## Rows: 28,523
## Columns: 106
## $ id <dbl> 6983, 26057, 26473, 29...
## $ listing_url <chr> "https://www.airbnb.co...
## $ scrape_id <dbl> 2.020063e+13, 2.020063...
## $ last_scraped <date> 2020-06-28, 2020-06-2...
## $ name <chr> "Copenhagen 'N Livin'"...
## $ summary <chr> "Lovely apartment loca...
## $ space <chr> "Beautiful and cosy ap...
## $ description <chr> "Lovely apartment loca...
## $ experiences_offered <chr> "none", "none", "none"...
## $ neighborhood_overview <chr> "Nice bars and cozy ca...
## $ notes <chr> NA, NA, NA, NA, "Pleas...
## $ transit <chr> "Bus 66 runs to the ce...
## $ access <chr> "Bedroom, living room,...
## $ interaction <chr> "We are usually at wor...
## $ house_rules <chr> "No smoking allowed! N...
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url <chr> "https://a0.muscache.c...
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id <dbl> 16774, 109777, 112210,...
## $ host_url <chr> "https://www.airbnb.co...
## $ host_name <chr> "Simon", "Kari", "Oliv...
## $ host_since <date> 2009-05-12, 2010-04-1...
## $ host_location <chr> "Copenhagen, Capital R...
## $ host_about <chr> "I'm currently working...
## $ host_response_time <chr> "N/A", "N/A", "within ...
## $ host_response_rate <chr> "N/A", "N/A", "100%", ...
## $ host_acceptance_rate <chr> "33%", "19%", "100%", ...
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url <chr> "https://a0.muscache.c...
## $ host_picture_url <chr> "https://a0.muscache.c...
## $ host_neighbourhood <chr> "Nørrebro", "Indre By"...
## $ host_listings_count <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_total_listings_count <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_verifications <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified <lgl> FALSE, FALSE, TRUE, FA...
## $ street <chr> "Copenhagen, Hovedstad...
## $ neighbourhood <chr> "Nørrebro", "Indre By"...
## $ neighbourhood_cleansed <chr> "Nrrebro", "Indre By",...
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA...
## $ city <chr> "Copenhagen", "Copenha...
## $ state <chr> "Hovedstaden", "Hoveds...
## $ zipcode <dbl> 2200, 2100, 1210, 1650...
## $ market <chr> "Copenhagen", "Copenha...
## $ smart_location <chr> "Copenhagen, Denmark",...
## $ country_code <chr> "DK", "DK", "DK", "DK"...
## $ country <chr> "Denmark", "Denmark", ...
## $ latitude <dbl> 55.68798, 55.69163, 55...
## $ longitude <dbl> 12.54571, 12.57459, 12...
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE...
## $ property_type <chr> "Apartment", "House", ...
## $ room_type <chr> "Private room", "Entir...
## $ accommodates <dbl> 2, 6, 12, 2, 4, 3, 3, ...
## $ bathrooms <dbl> 1.0, 1.5, 2.5, 1.0, 1....
## $ bedrooms <dbl> 1, 4, 6, 1, 3, 1, 1, 2...
## $ beds <dbl> 1, 4, 7, 1, 3, 3, 2, 2...
## $ bed_type <chr> "Real Bed", "Real Bed"...
## $ amenities <chr> "{TV,\"Cable TV\",Wifi...
## $ square_feet <dbl> 97, NA, NA, NA, NA, 68...
## $ price <chr> "$365.00", "$2,398.00"...
## $ weekly_price <chr> NA, NA, "$17,513.00", ...
## $ monthly_price <chr> NA, NA, "$67,073.00", ...
## $ security_deposit <chr> "$0.00", "$5,000.00", ...
## $ cleaning_fee <chr> "$33.00", "$1,100.00",...
## $ guests_included <dbl> 1, 3, 1, 1, 1, 2, 2, 2...
## $ extra_people <chr> "$66.00", "$350.00", "...
## $ minimum_nights <dbl> 2, 3, 3, 7, 7, 2, 3, 6...
## $ maximum_nights <dbl> 15, 30, 31, 14, 31, 10...
## $ minimum_minimum_nights <dbl> 2, 3, 3, 3, 7, 2, 3, 6...
## $ maximum_minimum_nights <dbl> 2, 3, 3, 5, 7, 2, 3, 6...
## $ minimum_maximum_nights <dbl> 15, 30, 1125, 14, 1125...
## $ maximum_maximum_nights <dbl> 15, 30, 1125, 14, 1125...
## $ minimum_nights_avg_ntm <dbl> 2.0, 3.0, 3.0, 4.1, 7....
## $ maximum_nights_avg_ntm <dbl> 15, 30, 1125, 14, 1125...
## $ calendar_updated <chr> "5 months ago", "4 mon...
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30 <dbl> 29, 28, 29, 21, 0, 0, ...
## $ availability_60 <dbl> 59, 58, 59, 21, 0, 0, ...
## $ availability_90 <dbl> 89, 88, 89, 21, 0, 0, ...
## $ availability_365 <dbl> 89, 363, 172, 21, 0, 5...
## $ calendar_last_scraped <date> 2020-06-28, 2020-06-2...
## $ number_of_reviews <dbl> 168, 50, 293, 22, 90, ...
## $ number_of_reviews_ltm <dbl> 1, 4, 31, 2, 0, 0, 1, ...
## $ first_review <date> 2009-09-04, 2013-12-0...
## $ last_review <date> 2019-07-19, 2019-12-1...
## $ review_scores_rating <dbl> 96, 98, 91, 98, 94, 97...
## $ review_scores_accuracy <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_cleanliness <dbl> 9, 10, 9, 10, 9, 10, 1...
## $ review_scores_checkin <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 10,...
## $ review_scores_location <dbl> 9, 10, 10, 10, 10, 10,...
## $ review_scores_value <dbl> 9, 10, 9, 10, 9, 9, 9,...
## $ requires_license <lgl> FALSE, FALSE, FALSE, F...
## $ license <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, F...
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy <chr> "moderate", "moderate"...
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, F...
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_entire_homes <dbl> 0, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0...
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month <dbl> 1.28, 0.62, 2.48, 0.18...
| Name | listings |
| Number of rows | 28523 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 45 |
| Date | 5 |
| logical | 16 |
| numeric | 40 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 28523 | 0 |
| name | 57 | 1.00 | 1 | 211 | 0 | 26907 | 0 |
| summary | 1096 | 0.96 | 1 | 1000 | 0 | 26986 | 0 |
| space | 11390 | 0.60 | 1 | 1000 | 0 | 16806 | 0 |
| description | 515 | 0.98 | 1 | 1000 | 0 | 27748 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 12407 | 0.57 | 1 | 1000 | 0 | 15520 | 0 |
| notes | 20832 | 0.27 | 1 | 1000 | 0 | 7366 | 0 |
| transit | 11316 | 0.60 | 1 | 1000 | 0 | 16714 | 0 |
| access | 15751 | 0.45 | 1 | 1000 | 0 | 11251 | 0 |
| interaction | 14111 | 0.51 | 1 | 1000 | 0 | 13728 | 0 |
| house_rules | 13332 | 0.53 | 1 | 1000 | 0 | 13703 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 28282 | 0 |
| host_url | 0 | 1.00 | 37 | 43 | 0 | 25745 | 0 |
| host_name | 12 | 1.00 | 1 | 34 | 0 | 6415 | 0 |
| host_location | 93 | 1.00 | 2 | 152 | 0 | 863 | 0 |
| host_about | 14028 | 0.51 | 1 | 3550 | 0 | 12478 | 34 |
| host_response_time | 11 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 11 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 11 | 1.00 | 2 | 4 | 0 | 100 | 0 |
| host_thumbnail_url | 11 | 1.00 | 55 | 106 | 0 | 25669 | 0 |
| host_picture_url | 11 | 1.00 | 57 | 109 | 0 | 25669 | 0 |
| host_neighbourhood | 8007 | 0.72 | 1 | 21 | 0 | 56 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 316 | 0 |
| street | 0 | 1.00 | 10 | 61 | 0 | 528 | 0 |
| neighbourhood | 0 | 1.00 | 5 | 14 | 0 | 21 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 25 | 0 | 11 | 0 |
| city | 13 | 1.00 | 1 | 26 | 0 | 136 | 0 |
| state | 24042 | 0.16 | 1 | 25 | 0 | 170 | 0 |
| market | 864 | 0.97 | 6 | 21 | 0 | 10 | 0 |
| smart_location | 0 | 1.00 | 10 | 35 | 0 | 158 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 29 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 26634 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 611 | 0 |
| weekly_price | 25008 | 0.12 | 7 | 11 | 0 | 718 | 0 |
| monthly_price | 26971 | 0.05 | 7 | 11 | 0 | 476 | 0 |
| security_deposit | 13845 | 0.51 | 5 | 10 | 0 | 386 | 0 |
| cleaning_fee | 8968 | 0.69 | 5 | 9 | 0 | 430 | 0 |
| extra_people | 0 | 1.00 | 5 | 9 | 0 | 265 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 82 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| host_since | 11 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-06-22 | 3379 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| first_review | 4968 | 0.83 | 2009-09-04 | 2020-06-28 | 2017-08-07 | 2582 |
| last_review | 4968 | 0.83 | 2011-08-10 | 2020-06-28 | 2019-07-29 | 1846 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 28523 | 0 | NaN | : |
| medium_url | 28523 | 0 | NaN | : |
| xl_picture_url | 28523 | 0 | NaN | : |
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| host_has_profile_pic | 11 | 1 | 1.00 | TRU: 28445, FAL: 67 |
| host_identity_verified | 11 | 1 | 0.38 | FAL: 17790, TRU: 10722 |
| neighbourhood_group_cleansed | 28523 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
| has_availability | 0 | 1 | 1.00 | TRU: 28523 |
| requires_license | 0 | 1 | 0.00 | FAL: 28523 |
| license | 28523 | 0 | NaN | : |
| jurisdiction_names | 28523 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.29 | FAL: 20360, TRU: 8163 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 28523 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 28379, TRU: 144 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 28327, TRU: 196 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.044063e+07 | 12261143.29 | 6.983000e+03 | 1.008379e+07 | 1.950297e+07 | 3.043587e+07 | 4.395635e+07 | <U+2587><U+2587><U+2587><U+2585><U+2586> |
| scrape_id | 0 | 1.00 | 2.020063e+13 | 0.00 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 6.731400e+07 | 76172316.22 | 5.130000e+02 | 1.281774e+07 | 3.604205e+07 | 9.463162e+07 | 3.519509e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 11 | 1.00 | 5.460000e+00 | 34.32 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 7.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 11 | 1.00 | 5.460000e+00 | 34.32 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 7.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 815 | 0.97 | 2.069760e+03 | 390.81 | 2.000000e+01 | 1.855750e+03 | 2.200000e+03 | 2.300000e+03 | 2.100000e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.568000e+01 | 0.02 | 5.562000e+01 | 5.567000e+01 | 5.568000e+01 | 5.570000e+01 | 5.573000e+01 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 1.256000e+01 | 0.03 | 1.245000e+01 | 1.254000e+01 | 1.255000e+01 | 1.258000e+01 | 1.264000e+01 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.320000e+00 | 1.63 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 12 | 1.00 | 1.080000e+00 | 0.28 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.550000e+00 | 1.06 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.010000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.040000e+00 | 1.44 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 2.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 28130 | 0.01 | 7.218900e+02 | 576.10 | 0.000000e+00 | 1.200000e+02 | 7.640000e+02 | 1.076000e+03 | 2.799000e+03 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.520000e+00 | 1.06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 3.850000e+00 | 18.07 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 6.206700e+02 | 553.19 | 1.000000e+00 | 1.500000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 3.840000e+00 | 18.08 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 4.120000e+00 | 19.29 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 6.669200e+02 | 548.00 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 6.703800e+02 | 547.39 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 3.970000e+00 | 18.35 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 6.684100e+02 | 547.20 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 5.870000e+00 | 10.42 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 8.000000e+00 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 1.124000e+01 | 20.15 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.400000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_90 | 0 | 1.00 | 1.649000e+01 | 29.99 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.800000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_365 | 0 | 1.00 | 4.947000e+01 | 99.23 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.500000e+01 | 3.650000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 1.357000e+01 | 27.09 | 0.000000e+00 | 1.000000e+00 | 5.000000e+00 | 1.500000e+01 | 6.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.740000e+00 | 6.64 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 3.760000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 9.516000e+01 | 6.83 | 2.000000e+01 | 9.300000e+01 | 9.700000e+01 | 1.000000e+02 | 1.000000e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5468 | 0.81 | 9.730000e+00 | 0.66 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5466 | 0.81 | 9.380000e+00 | 0.96 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5488 | 0.81 | 9.820000e+00 | 0.56 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5470 | 0.81 | 9.860000e+00 | 0.53 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5491 | 0.81 | 9.600000e+00 | 0.68 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5495 | 0.81 | 9.450000e+00 | 0.76 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 4.450000e+00 | 28.06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.810000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.170000e+00 | 28.08 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.810000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.600000e-01 | 0.73 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.33 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.300000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 4968 | 0.83 | 4.900000e-01 | 0.73 | 1.000000e-02 | 1.200000e-01 | 2.800000e-01 | 5.900000e-01 | 3.065000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
From this list can be concluded that there are 28,523 values for 106 variables. there are a couple of values that contain NA, 1 of them, cleaning_fee, stands out and is one of the variables that was flagged as important in this analysis. This will be addressed in the next section. Other variables that stand out are price, cleaning_fee and extra_people. these variables are stored as character values and need to have the $-sign removed to be stored as a numeric value. Since these variables will be used further in the analyis, this also will be resolved in the next section where we clean the data and run summary statistics.
Computing summary statistics of the variables of interest
In this section we will clean the data and run summary statistics that will be visualised in the next section. First the values stored as characters are converted to numeric values and the $-sign removed.
# Make price into a numeric variable
listings$price = as.numeric(gsub("[\\$,]", "", listings$price))
# Make cleaning fee into a numeric variable
listings$cleaning_fee = as.numeric(gsub("[\\$,]", "", listings$cleaning_fee))
# Make extra people a numeric variable
listings$extra_people = as.numeric(gsub("[\\$,]", "", listings$extra_people))
Following this, the NA’s in the cleaning_fee values will be converted to zero. Assuming the NA’s are from hosts not providing an input for cleaning fees, the NA’s should equal zero.
# Change NA cleaning fee values to 0, assuming that 0 is meant for NA
listings <- listings %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee
))
The property type categorical variables has too many categories to be summarised. To reduce this, the 4 most substantial categories have been specified, the remainder will be classed as other.
# Create a list of property types and the count
listings%>%
group_by(property_type)%>%
summarise(count= n())%>%
arrange(desc(count))%>%
kable()
## `summarise()` ungrouping output (override with `.groups` argument)
| property_type | count |
|---|---|
| Apartment | 23942 |
| Condominium | 1673 |
| House | 1365 |
| Townhouse | 554 |
| Serviced apartment | 331 |
| Loft | 231 |
| Villa | 190 |
| Hostel | 28 |
| Guesthouse | 23 |
| Bungalow | 22 |
| Guest suite | 22 |
| Bed and breakfast | 21 |
| Houseboat | 21 |
| Boat | 20 |
| Hotel | 20 |
| Other | 14 |
| Tiny house | 14 |
| Cabin | 10 |
| Boutique hotel | 5 |
| Cottage | 4 |
| Hut | 3 |
| Aparthotel | 2 |
| Barn | 2 |
| Camper/RV | 1 |
| Casa particular (Cuba) | 1 |
| Earth house | 1 |
| Island | 1 |
| Lighthouse | 1 |
| Tent | 1 |
# Take top 4 property_type and create a new column and assigning property_type in groups
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","Condominium", "House","Townhouse") ~ property_type,
TRUE ~ "Other"
))
# Check if all the variables are in the correct type
skim(listings)
| Name | listings |
| Number of rows | 28523 |
| Number of columns | 107 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| Date | 5 |
| logical | 16 |
| numeric | 43 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 28523 | 0 |
| name | 57 | 1.00 | 1 | 211 | 0 | 26907 | 0 |
| summary | 1096 | 0.96 | 1 | 1000 | 0 | 26986 | 0 |
| space | 11390 | 0.60 | 1 | 1000 | 0 | 16806 | 0 |
| description | 515 | 0.98 | 1 | 1000 | 0 | 27748 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 12407 | 0.57 | 1 | 1000 | 0 | 15520 | 0 |
| notes | 20832 | 0.27 | 1 | 1000 | 0 | 7366 | 0 |
| transit | 11316 | 0.60 | 1 | 1000 | 0 | 16714 | 0 |
| access | 15751 | 0.45 | 1 | 1000 | 0 | 11251 | 0 |
| interaction | 14111 | 0.51 | 1 | 1000 | 0 | 13728 | 0 |
| house_rules | 13332 | 0.53 | 1 | 1000 | 0 | 13703 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 28282 | 0 |
| host_url | 0 | 1.00 | 37 | 43 | 0 | 25745 | 0 |
| host_name | 12 | 1.00 | 1 | 34 | 0 | 6415 | 0 |
| host_location | 93 | 1.00 | 2 | 152 | 0 | 863 | 0 |
| host_about | 14028 | 0.51 | 1 | 3550 | 0 | 12478 | 34 |
| host_response_time | 11 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 11 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 11 | 1.00 | 2 | 4 | 0 | 100 | 0 |
| host_thumbnail_url | 11 | 1.00 | 55 | 106 | 0 | 25669 | 0 |
| host_picture_url | 11 | 1.00 | 57 | 109 | 0 | 25669 | 0 |
| host_neighbourhood | 8007 | 0.72 | 1 | 21 | 0 | 56 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 316 | 0 |
| street | 0 | 1.00 | 10 | 61 | 0 | 528 | 0 |
| neighbourhood | 0 | 1.00 | 5 | 14 | 0 | 21 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 25 | 0 | 11 | 0 |
| city | 13 | 1.00 | 1 | 26 | 0 | 136 | 0 |
| state | 24042 | 0.16 | 1 | 25 | 0 | 170 | 0 |
| market | 864 | 0.97 | 6 | 21 | 0 | 10 | 0 |
| smart_location | 0 | 1.00 | 10 | 35 | 0 | 158 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 29 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 26634 | 0 |
| weekly_price | 25008 | 0.12 | 7 | 11 | 0 | 718 | 0 |
| monthly_price | 26971 | 0.05 | 7 | 11 | 0 | 476 | 0 |
| security_deposit | 13845 | 0.51 | 5 | 10 | 0 | 386 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 82 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 4 | 0 |
| prop_type_simplified | 0 | 1.00 | 5 | 11 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| host_since | 11 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-06-22 | 3379 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| first_review | 4968 | 0.83 | 2009-09-04 | 2020-06-28 | 2017-08-07 | 2582 |
| last_review | 4968 | 0.83 | 2011-08-10 | 2020-06-28 | 2019-07-29 | 1846 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 28523 | 0 | NaN | : |
| medium_url | 28523 | 0 | NaN | : |
| xl_picture_url | 28523 | 0 | NaN | : |
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| host_has_profile_pic | 11 | 1 | 1.00 | TRU: 28445, FAL: 67 |
| host_identity_verified | 11 | 1 | 0.38 | FAL: 17790, TRU: 10722 |
| neighbourhood_group_cleansed | 28523 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
| has_availability | 0 | 1 | 1.00 | TRU: 28523 |
| requires_license | 0 | 1 | 0.00 | FAL: 28523 |
| license | 28523 | 0 | NaN | : |
| jurisdiction_names | 28523 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.29 | FAL: 20360, TRU: 8163 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 28523 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 28379, TRU: 144 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 28327, TRU: 196 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.044063e+07 | 12261143.29 | 6.983000e+03 | 1.008379e+07 | 1.950297e+07 | 3.043587e+07 | 4.395635e+07 | <U+2587><U+2587><U+2587><U+2585><U+2586> |
| scrape_id | 0 | 1.00 | 2.020063e+13 | 0.00 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | 2.020063e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 6.731400e+07 | 76172316.22 | 5.130000e+02 | 1.281774e+07 | 3.604205e+07 | 9.463162e+07 | 3.519509e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 11 | 1.00 | 5.460000e+00 | 34.32 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 7.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 11 | 1.00 | 5.460000e+00 | 34.32 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 7.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 815 | 0.97 | 2.069760e+03 | 390.81 | 2.000000e+01 | 1.855750e+03 | 2.200000e+03 | 2.300000e+03 | 2.100000e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.568000e+01 | 0.02 | 5.562000e+01 | 5.567000e+01 | 5.568000e+01 | 5.570000e+01 | 5.573000e+01 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 1.256000e+01 | 0.03 | 1.245000e+01 | 1.254000e+01 | 1.255000e+01 | 1.258000e+01 | 1.264000e+01 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.320000e+00 | 1.63 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 12 | 1.00 | 1.080000e+00 | 0.28 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.550000e+00 | 1.06 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.010000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.040000e+00 | 1.44 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 2.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 28130 | 0.01 | 7.218900e+02 | 576.10 | 0.000000e+00 | 1.200000e+02 | 7.640000e+02 | 1.076000e+03 | 2.799000e+03 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| price | 0 | 1.00 | 8.491900e+02 | 1066.81 | 0.000000e+00 | 4.980000e+02 | 6.980000e+02 | 9.960000e+02 | 6.917500e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 0 | 1.00 | 2.103900e+02 | 249.69 | 0.000000e+00 | 0.000000e+00 | 1.500000e+02 | 3.000000e+02 | 4.000000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.520000e+00 | 1.06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 6.388000e+01 | 126.23 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+02 | 2.024000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 3.850000e+00 | 18.07 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 6.206700e+02 | 553.19 | 1.000000e+00 | 1.500000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 3.840000e+00 | 18.08 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 4.120000e+00 | 19.29 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 6.669200e+02 | 548.00 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 6.703800e+02 | 547.39 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 3.970000e+00 | 18.35 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.100000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 6.684100e+02 | 547.20 | 1.000000e+00 | 2.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 5.870000e+00 | 10.42 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 8.000000e+00 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 1.124000e+01 | 20.15 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.400000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_90 | 0 | 1.00 | 1.649000e+01 | 29.99 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.800000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_365 | 0 | 1.00 | 4.947000e+01 | 99.23 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.500000e+01 | 3.650000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 1.357000e+01 | 27.09 | 0.000000e+00 | 1.000000e+00 | 5.000000e+00 | 1.500000e+01 | 6.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.740000e+00 | 6.64 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 3.760000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 9.516000e+01 | 6.83 | 2.000000e+01 | 9.300000e+01 | 9.700000e+01 | 1.000000e+02 | 1.000000e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5468 | 0.81 | 9.730000e+00 | 0.66 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5466 | 0.81 | 9.380000e+00 | 0.96 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5488 | 0.81 | 9.820000e+00 | 0.56 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5470 | 0.81 | 9.860000e+00 | 0.53 | 2.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5491 | 0.81 | 9.600000e+00 | 0.68 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5495 | 0.81 | 9.450000e+00 | 0.76 | 2.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 4.450000e+00 | 28.06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.810000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.170000e+00 | 28.08 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.810000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.600000e-01 | 0.73 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.33 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.300000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 4968 | 0.83 | 4.900000e-01 | 0.73 | 1.000000e-02 | 1.200000e-01 | 2.800000e-01 | 5.900000e-01 | 3.065000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
To only have a list of variables that will be used in the analysis, redundant variables are filtered out. As the analysis focuses on short holiday stays, the long stay listings need to be removed from the list. By looking at the minimum nights on the listings, it can be determined if these addresses are viable for short stays.
| minimum_nights | count |
|---|---|
| 2 | 8500 |
| 3 | 6479 |
| 1 | 5444 |
| 4 | 3117 |
| 5 | 2346 |
| 7 | 997 |
| 6 | 702 |
| 14 | 198 |
| 10 | 170 |
| 30 | 136 |
| 8 | 52 |
| 20 | 49 |
| 15 | 35 |
| 21 | 33 |
| 60 | 23 |
| 9 | 22 |
| 12 | 21 |
| 90 | 21 |
| 13 | 20 |
| 25 | 16 |
| 28 | 14 |
| 50 | 14 |
| 100 | 8 |
| 31 | 7 |
| 11 | 6 |
| 23 | 6 |
| 29 | 6 |
| 45 | 5 |
| 16 | 4 |
| 19 | 4 |
| 200 | 4 |
| 1000 | 4 |
| 17 | 3 |
| 40 | 3 |
| 70 | 3 |
| 80 | 3 |
| 120 | 3 |
| 18 | 2 |
| 22 | 2 |
| 24 | 2 |
| 26 | 2 |
| 27 | 2 |
| 49 | 2 |
| 85 | 2 |
| 160 | 2 |
| 180 | 2 |
| 360 | 2 |
| 365 | 2 |
| 999 | 2 |
| 34 | 1 |
| 35 | 1 |
| 39 | 1 |
| 42 | 1 |
| 43 | 1 |
| 48 | 1 |
| 56 | 1 |
| 59 | 1 |
| 61 | 1 |
| 66 | 1 |
| 75 | 1 |
| 89 | 1 |
| 92 | 1 |
| 102 | 1 |
| 107 | 1 |
| 150 | 1 |
| 270 | 1 |
| 300 | 1 |
| 430 | 1 |
| 500 | 1 |
| 1100 | 1 |
In this summary it can be seen that after 5 minimum days the numbers of listing drop significantly.The majority of listings are within the 1 to 4 minimum days stay. Therefore this will give a good indication of the listings intended for short, holiday, stays. In the next section the data is filtered to only feature a minimum of 4 nights or less. Also, the variables that will be used in the analysis are selected, so to only have relevant variables left to make the analysis more concise and clear.
Let’s have a look at the data.
# Check if all the variables are in the correct type
skim(listings_clean)
| Name | listings_clean |
| Number of rows | 28523 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 2 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1 | 33 | 37 | 0 | 28523 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| neighbourhood_cleansed | 0 | 1 | 5 | 25 | 0 | 11 | 0 |
| cancellation_policy | 0 | 1 | 8 | 27 | 0 | 4 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 20440633.59 | 12261143.29 | 6983.00 | 10083785.00 | 19502966.00 | 30435871.00 | 43956346.00 | <U+2587><U+2587><U+2587><U+2585><U+2586> |
| price | 0 | 1.00 | 849.19 | 1066.81 | 0.00 | 498.00 | 698.00 | 996.00 | 69175.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.52 | 1.06 | 1.00 | 1.00 | 1.00 | 2.00 | 16.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 0 | 1.00 | 210.39 | 249.69 | 0.00 | 0.00 | 150.00 | 300.00 | 4000.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 63.88 | 126.23 | 0.00 | 0.00 | 0.00 | 100.00 | 2024.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 13.57 | 27.09 | 0.00 | 1.00 | 5.00 | 15.00 | 637.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 95.16 | 6.83 | 20.00 | 93.00 | 97.00 | 100.00 | 100.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| bathrooms | 12 | 1.00 | 1.08 | 0.28 | 0.00 | 1.00 | 1.00 | 1.00 | 10.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.55 | 1.06 | 0.00 | 1.00 | 1.00 | 2.00 | 101.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.04 | 1.44 | 0.00 | 1.00 | 2.00 | 3.00 | 25.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| accommodates | 0 | 1.00 | 3.32 | 1.63 | 1.00 | 2.00 | 3.00 | 4.00 | 16.00 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| longitude | 0 | 1.00 | 12.56 | 0.03 | 12.45 | 12.54 | 12.55 | 12.58 | 12.64 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| latitude | 0 | 1.00 | 55.68 | 0.02 | 55.62 | 55.67 | 55.68 | 55.70 | 55.73 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
Visualisations
Now that the data is clean, a selection of useful data has been made, and filtered to only feature our criteria for a short stay holiday listing, visualisations of data can be made. This lays the foundation of any correlations and questions the regression model will have to explain and answer.
location
In order to get an idea on the concentration of apartment listings in the area of Copenhagen, we plotted a map showing the number of flat available for rent for 4 nights or less. We used clusters instead of point to make the map more readable and to get a better idea of the number of listings in each area.
# Take the filtered listings with a min of 4 nights and create a map of the apartments in Copenhagen
listings_filtered_4nights %>%
leaflet() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type,
#To get a better overview of the number and location of apartments in Copenhagen, a clustered map was chosen instead of the original point mapping
clusterOptions = markerClusterOptions()
)
The map shows the number of available flats are located in the city center. The more we get away from the center the fewer the number of flats available.
To gauge the price difference in each area in Copenhagen, we plotted a heatmap showing the more expensive areas in a brighter red colour. Also we plotted a bar chart to get an idea of the median price for each area. The reason for using median over mean is because the listings in Copenhagen have very extreme values and they skew the mean too much for certain areas.
#maps price to seven colors using quantiles
qpal <- colorQuantile("Reds", listings_filtered_4nights$price, n = 7)
listings_filtered_4nights %>%
leaflet() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type,
color = ~qpal(price))
We can conclude that the harbour area in Copehagen has the most expensive listings. Therefore we suspect that location will have a high coefficient when we will be running te regression models.
#Calculated median price per night for each neighborhood
median_per_neighborhood <- listings_filtered_4nights %>%
group_by(neighbourhood_cleansed) %>%
summarise(median_price = median(price))
#Price per night for each neighborhood
median_per_neighborhood %>%
ggplot(aes(
x = reorder(neighbourhood_cleansed, median_price),
y = median_price
)) +
geom_col(fill = "#FF5A5F") +
scale_fill_manual(values="#FF5A5F")+
labs(
title = "Median price per night per neighborhood",
x = "",
y = "Median price per neighborhood",
caption = "Source: Airbnb"
) +
coord_flip() +
theme_classic() +
NULL
The Indre By area in Copenhagen is the central harbour area and is the most expensive area for AirBnB prices. This confirms that the neighborhood variable will have a significant impact on the price variable.
accommodation type
Next up, we have plotted a number of bar charts to display the price relationship with the types of accommodation.
Median price per night per property type: The histogram showed that among all the property types, the lowest median price per night is allocated to the appartments
Median price per room type: The histogram showed that hotel rooms are generally more expensive than the other types of rooms available.
# Calculate median price per night for each room type
median_per_proptype <- listings_filtered_4nights %>%
group_by(prop_type_simplified) %>%
summarise(median_price = median(price))
# Plot a bar chart with median price per night for each property type
median_per_proptype %>%
ggplot(aes(
x = reorder(prop_type_simplified, desc(median_price)),
y = median_price
)) +
geom_col(fill = "#FF5A5F") +
scale_fill_manual(values="#FF5A5F")+
labs(
title = "Median price per night per property type",
x = "",
y = "Median price per night",
caption = "Source: AirBnB"
) +
NULL

# Calculate median price per night for each room type
median_per_roomtype <- listings_filtered_4nights %>%
group_by(room_type) %>%
summarise(median_price = median(price))
# Price per night for each room type
median_per_roomtype %>%
ggplot(aes(
x = reorder(room_type, desc(median_price)),
y = median_price
)) +
geom_col(fill = "#FF5A5F") +
scale_fill_manual(values="#FF5A5F")+
labs(
title = "Median price per night per room type",
x = "",
y = "Median price per night",
caption = "Source: AirBnB"
) +
NULL

From this we can conclude that both room type and property type have an influence on the price.
rating
Following from this, we wanted to know what influence rating has on the price. For this we used both data on the review scores and whether the listing is posted by a superhost.
#Plotted a density graph to display the relationship with price in being a superhost
listings_clean %>%
ggplot(aes( x = log(price),
#colour the graph by being a superhost or not and filtering out the NA values
colour = !is.na(host_is_superhost))) +
geom_density() +
labs(
title = "Density plot for prices on whether host is superhost",
x = "price per night (log)",
y = "density",
caption = "Source: AirBnB",
color = "(Non)superhost\n" ) +
scale_color_manual(labels = c("Superhost", "Non superhost"),
values = c("#FF5A5F", "#00A699")) +
theme_classic() +
NULL
## Warning: Removed 2 rows containing non-finite values (stat_density).

#Plotted a histogram to display the distribution of ratings and to show if this is affected by being a superhost
listings_clean %>%
#Filtered out the NA's in superhost
filter(!is.na(host_is_superhost)) %>%
ggplot(aes(
x = review_scores_rating,
fill = host_is_superhost
)) +
# set the binwidth to 1 to match the scoring bins in the original data
geom_histogram(binwidth = 1) +
labs(
title = "Histogram for review score distribution with (non)superhosts",
x = "Review scores",
y = "Count",
caption = "Source: AirBnB"
) +
#change name of legend
scale_fill_discrete(
name = NULL,
labels = c("Non-superhost", "Superhost")
) +
theme_classic() +
NULL
## Warning: Removed 5442 rows containing non-finite values (stat_bin).

#plotted a scatter graph showing the relationship between review score and price
listings_clean %>%
ggplot(aes(
x = review_scores_rating,
y = price
)) +
geom_point() +
#Applied a logscale to make the relationship more visible
scale_y_log10() +
#Applied a linear line to display any relationship in with price with a higher review score
geom_smooth(method = lm) +
labs(
title = "Scatter plot showing relationship between review scores and price",
x = "Review scores",
y = "Price per night (log)",
caption = "Source: Airbnb"
) +
theme_classic() +
NULL
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 5448 rows containing non-finite values (stat_smooth).
## Warning: Removed 5447 rows containing missing values (geom_point).
From this we can conclude that being a superhost does have an influence on the price. People would prefer a superhost, so it seems logical that the average price for a superhost is higher than a listing under a normal host. The ratings however don’t seem to be related to being a superhost. In fact the perfect scores are more among non-superhosts. This could be because of the higher expectations set by booking via a superhost, this could make guests more critical when writing a review. finally, there does not seem to be a strong correlation with a rating score and the price. Hosts will not alter their prices to their review scores.
GGpairs
To display a comprehensive overview of the correlation between variables and the price we plotted a GGpairs plot. Using all the selected variables.
# Display the correlation between four important variables
listings_clean %>%
select(price, review_scores_rating, bedrooms, bathrooms, accommodates) %>%
ggpairs(cardinality_threshold=NULL)
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5447 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 29 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## Warning: Removed 5447 rows containing missing values (geom_point).
## Warning: Removed 5447 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5458 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5454 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5447 rows containing missing values
## Warning: Removed 29 rows containing missing values (geom_point).
## Warning: Removed 5458 rows containing missing values (geom_point).
## Warning: Removed 29 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 38 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 29 rows containing missing values
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 5454 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## Warning: Removed 5447 rows containing missing values (geom_point).
## Warning: Removed 29 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_point).

REGRESSION
Basic explanatory variables
Before we start with the regression modeling, we need to filter the data to represent 2 people staying for 4 nights, including the cleaning fee.
# Create new variable for 4 nights using price, guests_included, cleaning_fee and extra_people
listings_new <- listings_clean %>%
mutate(price_4_nights = ifelse(
guests_included <= 1,
(price + extra_people) * 4 + cleaning_fee,
(price) * 4 + cleaning_fee
))
The following plots show the density for the total price for 4 nights wjth and without a log function.
# Density plots for price_4_nights
density.default(listings_new$price_4_nights)
##
## Call:
## density.default(x = listings_new$price_4_nights)
##
## Data: listings_new$price_4_nights (28523 obs.); Bandwidth 'bw' = 173
##
## x y
## Min. : -518.9 Min. :0.000e+00
## 1st Qu.: 68915.5 1st Qu.:0.000e+00
## Median :138350.0 Median :0.000e+00
## Mean :138350.0 Mean :4.589e-06
## 3rd Qu.:207784.5 3rd Qu.:2.000e-10
## Max. :277218.9 Max. :3.565e-04
# Calculate the most frequently occurring price (mode) for 4 nights
max_p4n <- density(listings_new$price_4_nights)$x[which.max(density(listings_new$price_4_nights)$y)]
max_p4n
## [1] 2742.17
# Plot density of price_4_nights
ggplot(
listings_new,
aes(x = price_4_nights)) +
geom_density() +
# Plot vertical line to show maximum value
geom_vline(xintercept = max_p4n) +
theme_classic() +
NULL

# Plot density of log(price_4_nights)
ggplot(listings_new, aes(x = log(price_4_nights))) +
geom_density() +
theme_classic() +
NULL
## Warning: Removed 1 rows containing non-finite values (stat_density).
we should use variable log(price_4_nights). Looking at the density plots of
price_4_nights and log(price_4_nights), we can see clearly that log(price_4_nights) is more close to a normal distribution, while most values in price_4_nights are cramped within a very small range.
Model 1
Fitting a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating
# Create a new dataset with the new variable of log(price_4_nights)
listings_log <- listings_new %>%
mutate(price_4_nights_log = log10(price_4_nights))
# Create model1 of price_4_nights with explanatory variables prop_type_simplified, number_of_reviews and review_scores_ratings
model1 <- lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_log
)
model1 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.24 | 0.0198 | 164 | 0 | 3.21 | 3.28 |
| prop_type_simplifiedCondominium | 0.033 | 0.00576 | 5.74 | 9.66e-09 | 0.0218 | 0.0443 |
| prop_type_simplifiedHouse | 0.119 | 0.00693 | 17.1 | 4.35e-65 | 0.105 | 0.132 |
| prop_type_simplifiedOther | 0.178 | 0.00872 | 20.4 | 1.7e-91 | 0.161 | 0.195 |
| prop_type_simplifiedTownhouse | 0.177 | 0.0102 | 17.4 | 2.2e-67 | 0.157 | 0.197 |
| number_of_reviews | 0.000188 | 4.84e-05 | 3.89 | 0.000101 | 9.34e-05 | 0.000283 |
| review_scores_rating | 0.00235 | 0.000207 | 11.3 | 9.89e-30 | 0.00194 | 0.00276 |
model1 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.046 | 0.0457 | 0.215 | 185 | 3.13e-231 | 6 | 2.77e+03 | -5.53e+03 | -5.46e+03 | 1.06e+03 | 23069 | 23076 |
# Check residuals
autoplot(model1)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Ratings (review_scores_rating) is a significant predictor. Controlling for other variables, every 1 point increase in rating is associated with $ 0.002 increase in log10(price_4_nights).
Interpretation of prop_type_simplified predictor:
Property type (prop_type_simplified) is a significant predictor. Controlling for other variables, switching property type from apartment to condomunium would result in 0.034 increase in log10(price_4_nights). Similarly, switching property type from apartment to house would result in 0.11 increase in log10(price_4_nights); switching property type from apartment to townhouse would result in 0.17 increase in log10(price_4_nights); switching property type from apartment to property types other than the aforementioned 3 types would result in an average of 0.18 increase in log10(price_4_nights).
Model 1: 4%
Model 2
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.
# Create model2 by adding room_type to model1
model2 <- lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = listings_log
)
model2 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.36 | 0.0177 | 189 | 0 | 3.32 | 3.39 |
| prop_type_simplifiedCondominium | 0.0168 | 0.00513 | 3.27 | 0.00107 | 0.00674 | 0.0269 |
| prop_type_simplifiedHouse | 0.126 | 0.00618 | 20.4 | 1.8e-91 | 0.114 | 0.138 |
| prop_type_simplifiedOther | 0.168 | 0.00797 | 21.1 | 1.14e-97 | 0.152 | 0.184 |
| prop_type_simplifiedTownhouse | 0.166 | 0.00905 | 18.3 | 1.98e-74 | 0.148 | 0.184 |
| number_of_reviews | 0.000668 | 4.36e-05 | 15.3 | 1.33e-52 | 0.000582 | 0.000753 |
| review_scores_rating | 0.00154 | 0.000185 | 8.33 | 8.6e-17 | 0.00118 | 0.00191 |
| room_typeHotel room | 0.112 | 0.0342 | 3.28 | 0.00104 | 0.0452 | 0.179 |
| room_typePrivate room | -0.271 | 0.00355 | -76.3 | 0 | -0.278 | -0.264 |
| room_typeShared room | -0.362 | 0.0246 | -14.7 | 5.51e-49 | -0.41 | -0.314 |
model2 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.243 | 0.243 | 0.191 | 823 | 0 | 9 | 5.44e+03 | -1.09e+04 | -1.08e+04 | 843 | 23066 | 23076 |
# Check residuals
autoplot(model2)

Interpretation of room_type predictor:
The result of model2 regression shows that rooms_type is a significant predictor. Controlling for other variables, switching room type from entire house to hotel room would result in 0.11 increase in log10(price_4_nights). Nevertheless, switching from entire house to private room would result in 0.27 drop in log10(price_4_nights); switching from entire house to shared room would result in 0.38 drop in log10(price_4_nights).
Model 2: 24%
Exploring additional explanatory variables
Model 3
Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights?
# Create model3a by adding bathrooms, bedrooms, beds, accommodates to model2
model3a<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates,
data = listings_log)
model3a %>%
tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.08 | 0.0163 | 189 | 0 | 3.05 | 3.12 |
| prop_type_simplifiedCondominium | 0.0105 | 0.00452 | 2.31 | 0.0207 | 0.0016 | 0.0193 |
| prop_type_simplifiedHouse | -0.0476 | 0.00594 | -8.02 | 1.08e-15 | -0.0593 | -0.036 |
| prop_type_simplifiedOther | 0.0906 | 0.00709 | 12.8 | 2.97e-37 | 0.0767 | 0.105 |
| prop_type_simplifiedTownhouse | 0.00279 | 0.00835 | 0.334 | 0.738 | -0.0136 | 0.0191 |
| number_of_reviews | 0.000596 | 3.85e-05 | 15.5 | 7.78e-54 | 0.000521 | 0.000672 |
| review_scores_rating | 0.00166 | 0.000164 | 10.2 | 3.24e-24 | 0.00134 | 0.00199 |
| room_typeHotel room | 0.223 | 0.0302 | 7.38 | 1.65e-13 | 0.164 | 0.282 |
| room_typePrivate room | -0.18 | 0.00338 | -53.2 | 0 | -0.187 | -0.174 |
| room_typeShared room | -0.261 | 0.0217 | -12 | 3e-33 | -0.303 | -0.218 |
| bathrooms | 0.0581 | 0.0047 | 12.4 | 4.54e-35 | 0.0489 | 0.0673 |
| bedrooms | 0.0501 | 0.00224 | 22.3 | 1.83e-109 | 0.0457 | 0.0545 |
| beds | -0.00273 | 0.00125 | -2.19 | 0.0283 | -0.00517 | -0.00029 |
| accommodates | 0.0389 | 0.00131 | 29.6 | 2.69e-189 | 0.0363 | 0.0415 |
model3a %>%
glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.412 | 0.412 | 0.168 | 1.24e+03 | 0 | 13 | 8.36e+03 | -1.67e+04 | -1.66e+04 | 653 | 23021 | 23035 |
# Test VIF
car::vif(model3a)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.373951 4 1.040510
## number_of_reviews 1.029840 1 1.014810
## review_scores_rating 1.012554 1 1.006257
## room_type 1.272593 3 1.040994
## bathrooms 1.332643 1 1.154401
## bedrooms 3.071499 1 1.752569
## beds 2.550304 1 1.596967
## accommodates 3.628682 1 1.904910
# Check residuals
autoplot(model3a)

The number of bathrooms, bedrooms and size of the house (accommodates) are significant predictors of log10(price_4_nights) and are all positively related with price_4_nights, while number of beds is not a significant predictor. Coefficients show that each additional bathroom is associated with 0.06 increase in log10(price_4_nights); each additional bedroom is associated with 0.05 increase in log10(price_4_nights). There might be a problem with multi-collinearity since intuitively the number of bathrooms and bedrooms should have a positive relationship with size of the house. A further analysis on VIF shows that size of the house has slight correlation with other predictors but is within an acceptable range.
# Create model3 by removing beds from model3a
model3<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates,
data = listings_log
)
model3 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.08 | 0.0163 | 189 | 0 | 3.05 | 3.12 |
| prop_type_simplifiedCondominium | 0.0106 | 0.00452 | 2.34 | 0.0195 | 0.0017 | 0.0194 |
| prop_type_simplifiedHouse | -0.0481 | 0.00593 | -8.11 | 5.24e-16 | -0.0598 | -0.0365 |
| prop_type_simplifiedOther | 0.0905 | 0.00709 | 12.7 | 4.17e-37 | 0.0766 | 0.104 |
| prop_type_simplifiedTownhouse | 0.00227 | 0.00834 | 0.273 | 0.785 | -0.0141 | 0.0186 |
| number_of_reviews | 0.000598 | 3.85e-05 | 15.5 | 3.5e-54 | 0.000523 | 0.000674 |
| review_scores_rating | 0.00167 | 0.000164 | 10.2 | 1.7e-24 | 0.00135 | 0.00199 |
| room_typeHotel room | 0.223 | 0.0302 | 7.39 | 1.58e-13 | 0.164 | 0.282 |
| room_typePrivate room | -0.181 | 0.00338 | -53.5 | 0 | -0.187 | -0.174 |
| room_typeShared room | -0.262 | 0.0217 | -12.1 | 1.91e-33 | -0.304 | -0.219 |
| bathrooms | 0.0581 | 0.0047 | 12.4 | 4.75e-35 | 0.0489 | 0.0673 |
| bedrooms | 0.0488 | 0.00216 | 22.6 | 1.72e-111 | 0.0446 | 0.0531 |
| accommodates | 0.0376 | 0.00118 | 31.9 | 1.11e-218 | 0.0353 | 0.0399 |
model3 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.413 | 0.412 | 0.168 | 1.35e+03 | 0 | 12 | 8.36e+03 | -1.67e+04 | -1.66e+04 | 654 | 23047 | 23060 |
# Test VIF
car::vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.370590 4 1.040192
## number_of_reviews 1.029562 1 1.014673
## review_scores_rating 1.011882 1 1.005924
## room_type 1.271896 3 1.040899
## bathrooms 1.332684 1 1.154419
## bedrooms 2.862720 1 1.691957
## accommodates 2.929976 1 1.711717
# Check residuals
autoplot(model3)

Model 3: 41%
Model 4
Do superhosts command a pricing premium, after controlling for other variables?
# Create model4 by adding host_is_superhost to model3
model4<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
host_is_superhost,
data = listings_log
)
model4 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1 | 0.0164 | 189 | 0 | 3.07 | 3.13 |
| prop_type_simplifiedCondominium | 0.00988 | 0.00452 | 2.19 | 0.0287 | 0.00103 | 0.0187 |
| prop_type_simplifiedHouse | -0.0478 | 0.00592 | -8.07 | 7.47e-16 | -0.0594 | -0.0362 |
| prop_type_simplifiedOther | 0.089 | 0.00708 | 12.6 | 4.07e-36 | 0.0751 | 0.103 |
| prop_type_simplifiedTownhouse | 0.00141 | 0.00833 | 0.17 | 0.865 | -0.0149 | 0.0177 |
| number_of_reviews | 0.000484 | 4.01e-05 | 12.1 | 2.24e-33 | 0.000405 | 0.000562 |
| review_scores_rating | 0.00145 | 0.000165 | 8.81 | 1.28e-18 | 0.00113 | 0.00178 |
| room_typeHotel room | 0.226 | 0.0301 | 7.49 | 6.91e-14 | 0.167 | 0.285 |
| room_typePrivate room | -0.183 | 0.00338 | -54.1 | 0 | -0.189 | -0.176 |
| room_typeShared room | -0.261 | 0.0216 | -12.1 | 2.06e-33 | -0.304 | -0.219 |
| bathrooms | 0.0582 | 0.00469 | 12.4 | 3.14e-35 | 0.049 | 0.0673 |
| bedrooms | 0.0491 | 0.00216 | 22.7 | 2.89e-113 | 0.0449 | 0.0534 |
| accommodates | 0.0374 | 0.00118 | 31.8 | 9.41e-217 | 0.0351 | 0.0397 |
| host_is_superhostTRUE | 0.0354 | 0.00358 | 9.88 | 5.54e-23 | 0.0284 | 0.0424 |
model4 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.415 | 0.415 | 0.168 | 1.26e+03 | 0 | 13 | 8.4e+03 | -1.68e+04 | -1.67e+04 | 651 | 23040 | 23054 |
# Test VIF
car::vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.371707 4 1.040298
## number_of_reviews 1.122623 1 1.059539
## review_scores_rating 1.028858 1 1.014326
## room_type 1.277660 3 1.041684
## bathrooms 1.332709 1 1.154430
## bedrooms 2.863006 1 1.692042
## accommodates 2.930492 1 1.711868
## host_is_superhost 1.122152 1 1.059317
# Check residuals
autoplot(model4)

Interpretation of superhost predictor:
Since the p-value for the estimate of the superhost coefficient is significant (with a p-value of 0.36 and a t-statistic of 0.9), it seems that superhosts do command a price premium from their guests in Copenhagen. Compared to non-superhost, being a superhost is associated with 0.04 increase in log10(price_4_nights).
Since the p-value for the estimate of the superhost coefficient is significant (with a p-value of 0.36 and a t-statistic of 0.9), it seems that superhosts do command a price premium from their guests in Copenhagen. Compared to non-superhost, being a superhost is associated with 0.03 increase in log10(price_4_nights).
# Create model4b by adding host_is_superhost to model4 and create an interaction variable
model4b<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
host_is_superhost +
host_is_superhost*review_scores_rating,
data = listings_log
)
model4b %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.11 | 0.0165 | 188 | 0 | 3.07 | 3.14 |
| prop_type_simplifiedCondominium | 0.00979 | 0.00452 | 2.17 | 0.0301 | 0.000945 | 0.0186 |
| prop_type_simplifiedHouse | -0.0477 | 0.00592 | -8.06 | 8.12e-16 | -0.0593 | -0.0361 |
| prop_type_simplifiedOther | 0.0891 | 0.00708 | 12.6 | 3.59e-36 | 0.0752 | 0.103 |
| prop_type_simplifiedTownhouse | 0.0015 | 0.00833 | 0.18 | 0.857 | -0.0148 | 0.0178 |
| number_of_reviews | 0.000489 | 4.02e-05 | 12.2 | 5.72e-34 | 0.00041 | 0.000568 |
| review_scores_rating | 0.00141 | 0.000166 | 8.45 | 3.12e-17 | 0.00108 | 0.00173 |
| room_typeHotel room | 0.228 | 0.0301 | 7.55 | 4.39e-14 | 0.169 | 0.287 |
| room_typePrivate room | -0.183 | 0.00338 | -54.1 | 0 | -0.19 | -0.176 |
| room_typeShared room | -0.261 | 0.0216 | -12.1 | 1.78e-33 | -0.304 | -0.219 |
| bathrooms | 0.058 | 0.00469 | 12.4 | 5.16e-35 | 0.0488 | 0.0672 |
| bedrooms | 0.0491 | 0.00216 | 22.7 | 2.95e-113 | 0.0449 | 0.0534 |
| accommodates | 0.0374 | 0.00118 | 31.8 | 5.54e-217 | 0.0351 | 0.0397 |
| host_is_superhostTRUE | -0.242 | 0.122 | -1.99 | 0.0471 | -0.481 | -0.00306 |
| review_scores_rating:host_is_superhostTRUE | 0.00285 | 0.00125 | 2.28 | 0.0228 | 0.000396 | 0.0053 |
model4b %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.415 | 0.415 | 0.168 | 1.17e+03 | 0 | 14 | 8.41e+03 | -1.68e+04 | -1.67e+04 | 651 | 23039 | 23054 |
# Test VIF
car::vif(model4b)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.371877 4 1.040314
## number_of_reviews 1.126391 1 1.061316
## review_scores_rating 1.046224 1 1.022851
## room_type 1.278764 3 1.041834
## bathrooms 1.333108 1 1.154603
## bedrooms 2.863011 1 1.692043
## accommodates 2.930623 1 1.711906
## host_is_superhost 1301.822960 1 36.080784
## review_scores_rating:host_is_superhost 1301.627193 1 36.078071
# Check residuals
autoplot(model4b)

Interaction variable host_is_superhost*review_scores_rating is not significant, therefore we do not include it.
Model 4: 41%
Model 5
Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?
# Create model5 by adding is_location_exact == TRUE to model4
model5<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
host_is_superhost +
is_location_exact,
data = listings_log
)
model5 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.11 | 0.0165 | 188 | 0 | 3.07 | 3.14 |
| prop_type_simplifiedCondominium | 0.00975 | 0.00452 | 2.16 | 0.0308 | 0.000903 | 0.0186 |
| prop_type_simplifiedHouse | -0.0478 | 0.00592 | -8.07 | 7.27e-16 | -0.0594 | -0.0362 |
| prop_type_simplifiedOther | 0.0884 | 0.00709 | 12.5 | 1.36e-35 | 0.0745 | 0.102 |
| prop_type_simplifiedTownhouse | 0.00103 | 0.00833 | 0.123 | 0.902 | -0.0153 | 0.0174 |
| number_of_reviews | 0.000485 | 4.01e-05 | 12.1 | 1.71e-33 | 0.000406 | 0.000563 |
| review_scores_rating | 0.00146 | 0.000165 | 8.85 | 9.22e-19 | 0.00114 | 0.00178 |
| room_typeHotel room | 0.227 | 0.0301 | 7.52 | 5.6e-14 | 0.168 | 0.286 |
| room_typePrivate room | -0.183 | 0.00338 | -54.1 | 0 | -0.19 | -0.176 |
| room_typeShared room | -0.261 | 0.0216 | -12.1 | 2.35e-33 | -0.303 | -0.218 |
| bathrooms | 0.0582 | 0.00469 | 12.4 | 2.72e-35 | 0.049 | 0.0674 |
| bedrooms | 0.0491 | 0.00216 | 22.7 | 2.79e-113 | 0.0449 | 0.0534 |
| accommodates | 0.0374 | 0.00118 | 31.8 | 3.56e-217 | 0.0351 | 0.0397 |
| host_is_superhostTRUE | 0.0356 | 0.00358 | 9.95 | 2.93e-23 | 0.0286 | 0.0426 |
| is_location_exactTRUE | -0.00547 | 0.00277 | -1.97 | 0.0484 | -0.0109 | -3.73e-05 |
model5 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.415 | 0.415 | 0.168 | 1.17e+03 | 0 | 14 | 8.41e+03 | -1.68e+04 | -1.67e+04 | 651 | 23039 | 23054 |
# Test VIF
car::vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.375094 4 1.040619
## number_of_reviews 1.122768 1 1.059607
## review_scores_rating 1.029247 1 1.014518
## room_type 1.278044 3 1.041736
## bathrooms 1.332750 1 1.154448
## bedrooms 2.863006 1 1.692042
## accommodates 2.931407 1 1.712135
## host_is_superhost 1.123547 1 1.059975
## is_location_exact 1.005398 1 1.002695
# Check residuals
autoplot(model5)

A listing’s exact location variable has a p value of 0.34, and therefore is not a significant predictor of price_4_nights. Thus, we will not include it in the following models.
Model 4: 41%
Model 6
In order to reduce the number of neighbourhoods for our analysis we cluster the different neighbourhoods into groups, based on our experience, talking to locals, and research.
The city of Copenhagen has 10 official administrative districts. Additionally, there is Frederiksberg, which is an independent municipality and, thus, separate from the Copenhagen Municipality, however, it is still part of Copenhagen city.
Indre By is the city center of Copenhagen and will remain its own cluster “Center”.
All the residential neighbourhoods surrounding the city center are commonly grouped together, and since they all end in -bro, often referred to as -bro districts (Brokvaterer in Danish). Frederiksberg is often also included in the -bro districts. Hence, we will cluster these 4 Brokvaterer districts into one group of Brokvarterer
Additionally, the two Amager districts (Vest and Ost) are grouped together.
In the West, Valby, Vanlose and Brønshøj-Husum are summarized.
Lastly, Bispebjerg, often referred to as Nordvest, which is more of a residential neighbourhood, is its own cluster.
# Create variable neighbourhood_simplified with 5 categories
listings_log <- listings_log %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c("Indre By") ~ "Center",
neighbourhood_cleansed %in% c("Frederiksberg","Nrrebro", "sterbro","Vesterbro-Kongens Enghave") ~ "Brokvarterer",
neighbourhood_cleansed %in% c("Amager st","Amager Vest") ~ "Amager",
neighbourhood_cleansed %in% c("Brnshj-Husum","Valby", "Vanlose") ~ "West",
TRUE ~ "Nordvest"
))
# Create model6 by adding neighbourhood_simplified to model3
model6<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
neighbourhood_simplified +
host_is_superhost+
NULL,
data = listings_log
)
model6 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1 | 0.0154 | 202 | 0 | 3.07 | 3.13 |
| prop_type_simplifiedCondominium | 0.0143 | 0.00418 | 3.42 | 0.000622 | 0.00611 | 0.0225 |
| prop_type_simplifiedHouse | 0.0189 | 0.0057 | 3.32 | 0.000915 | 0.00773 | 0.0301 |
| prop_type_simplifiedOther | 0.091 | 0.00657 | 13.9 | 1.83e-43 | 0.0781 | 0.104 |
| prop_type_simplifiedTownhouse | 0.0141 | 0.00775 | 1.82 | 0.0689 | -0.00109 | 0.0293 |
| number_of_reviews | 0.000231 | 3.73e-05 | 6.2 | 5.92e-10 | 0.000158 | 0.000304 |
| review_scores_rating | 0.00132 | 0.000153 | 8.61 | 7.6e-18 | 0.00102 | 0.00162 |
| room_typeHotel room | 0.139 | 0.0279 | 4.97 | 6.9e-07 | 0.0839 | 0.193 |
| room_typePrivate room | -0.175 | 0.00313 | -55.9 | 0 | -0.181 | -0.169 |
| room_typeShared room | -0.246 | 0.02 | -12.3 | 1.46e-34 | -0.285 | -0.207 |
| bathrooms | 0.0541 | 0.00434 | 12.5 | 1.51e-35 | 0.0456 | 0.0626 |
| bedrooms | 0.0466 | 0.002 | 23.3 | 1.26e-118 | 0.0427 | 0.0505 |
| accommodates | 0.0364 | 0.00109 | 33.4 | 1.2e-238 | 0.0343 | 0.0385 |
| neighbourhood_simplifiedBrokvarterer | 0.0274 | 0.00295 | 9.29 | 1.63e-20 | 0.0216 | 0.0331 |
| neighbourhood_simplifiedCenter | 0.161 | 0.00383 | 42.1 | 0 | 0.154 | 0.169 |
| neighbourhood_simplifiedNordvest | -0.0911 | 0.00451 | -20.2 | 3.76e-90 | -0.0999 | -0.0823 |
| neighbourhood_simplifiedWest | -0.0836 | 0.005 | -16.7 | 1.71e-62 | -0.0934 | -0.0738 |
| host_is_superhostTRUE | 0.0311 | 0.00331 | 9.4 | 5.85e-21 | 0.0246 | 0.0376 |
model6 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 0.499 | 0.155 | 1.35e+03 | 0 | 17 | 1.02e+04 | -2.04e+04 | -2.02e+04 | 557 | 23036 | 23054 |
# Test VIF
car::vif(model6)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.501568 4 1.052127
## number_of_reviews 1.136203 1 1.065928
## review_scores_rating 1.030892 1 1.015329
## room_type 1.286594 3 1.042894
## bathrooms 1.333482 1 1.154765
## bedrooms 2.867499 1 1.693369
## accommodates 2.937274 1 1.713848
## neighbourhood_simplified 1.151978 4 1.017842
## host_is_superhost 1.123124 1 1.059775
# Check residuals
autoplot(model6)

One can see that the neighbourhood of an apartment is an predictor of the price for 4 nights in Copenhagen.
Interpretation of neighbourhood predictor:
We can see that all the neighbourhood variables are significant predictors of the price for 4 nights in Copenhagen. Taking Amager as the base, Brokvarterer is asscociated with 0.03 more in log10(price_4_nights); Center is asscociated with 0.16 more in log10(price_4_nights); Nordvest is asscociated with 0.09 drop in log10(price_4_nights); West is asscociated with 0.09 drop in log10(price_4_nights).
Model 6: 50%
Model 7
What is the effect of cancellation_policy on price_4_nights, after we control for other variables?
# Create model7 by adding cancellation_policy to model6
model7<-lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
neighbourhood_simplified +
cancellation_policy+
host_is_superhost+NULL,
data = listings_log
)
model7 %>% tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.09 | 0.0153 | 202 | 0 | 3.06 | 3.12 |
| prop_type_simplifiedCondominium | 0.0124 | 0.00415 | 2.99 | 0.00284 | 0.00425 | 0.0205 |
| prop_type_simplifiedHouse | 0.0189 | 0.00566 | 3.35 | 0.000822 | 0.00784 | 0.03 |
| prop_type_simplifiedOther | 0.0734 | 0.00712 | 10.3 | 6.64e-25 | 0.0595 | 0.0874 |
| prop_type_simplifiedTownhouse | 0.0125 | 0.0077 | 1.63 | 0.104 | -0.00257 | 0.0276 |
| number_of_reviews | 0.00017 | 3.73e-05 | 4.57 | 4.99e-06 | 9.73e-05 | 0.000244 |
| review_scores_rating | 0.00131 | 0.000152 | 8.61 | 7.72e-18 | 0.00101 | 0.00161 |
| room_typeHotel room | 0.175 | 0.0279 | 6.29 | 3.18e-10 | 0.121 | 0.23 |
| room_typePrivate room | -0.17 | 0.00312 | -54.7 | 0 | -0.177 | -0.164 |
| room_typeShared room | -0.244 | 0.0199 | -12.3 | 1.19e-34 | -0.283 | -0.206 |
| bathrooms | 0.0532 | 0.00431 | 12.3 | 7.16e-35 | 0.0447 | 0.0616 |
| bedrooms | 0.0455 | 0.00199 | 22.9 | 1.16e-114 | 0.0416 | 0.0494 |
| accommodates | 0.0359 | 0.00108 | 33.2 | 1.05e-235 | 0.0338 | 0.038 |
| neighbourhood_simplifiedBrokvarterer | 0.0266 | 0.00292 | 9.11 | 9.1e-20 | 0.0209 | 0.0324 |
| neighbourhood_simplifiedCenter | 0.158 | 0.00381 | 41.5 | 0 | 0.151 | 0.166 |
| neighbourhood_simplifiedNordvest | -0.0898 | 0.00447 | -20.1 | 5.59e-89 | -0.0986 | -0.0811 |
| neighbourhood_simplifiedWest | -0.0805 | 0.00496 | -16.2 | 6.35e-59 | -0.0903 | -0.0708 |
| cancellation_policymoderate | 0.0269 | 0.00241 | 11.2 | 7.93e-29 | 0.0222 | 0.0316 |
| cancellation_policystrict_14_with_grace_period | 0.0457 | 0.00262 | 17.4 | 9.54e-68 | 0.0405 | 0.0508 |
| cancellation_policysuper_strict_60 | 0.127 | 0.0173 | 7.36 | 1.88e-13 | 0.0935 | 0.161 |
| host_is_superhostTRUE | 0.0294 | 0.00329 | 8.94 | 4.11e-19 | 0.023 | 0.0359 |
model7 %>% glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.507 | 0.507 | 0.154 | 1.19e+03 | 0 | 20 | 1.04e+04 | -2.07e+04 | -2.05e+04 | 548 | 23033 | 23054 |
# Test VIF
car::vif(model7)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.793107 4 1.075724
## number_of_reviews 1.153751 1 1.074128
## review_scores_rating 1.037877 1 1.018762
## room_type 1.312570 3 1.046374
## bathrooms 1.335585 1 1.155675
## bedrooms 2.871572 1 1.694571
## accommodates 2.943557 1 1.715680
## neighbourhood_simplified 1.161450 4 1.018885
## cancellation_policy 1.268876 3 1.040487
## host_is_superhost 1.124756 1 1.060545
# Check residuals
autoplot(model7)

The cancellation policy is a significant predictor on price, with a flexible policy commanding the lowest price and with a strict policy commanding the highest price. This is surprising since usually hosts should be rewarded for the risk that they are taking with a flexible cancellation policy. On the other hand, usually expensive and nice places have a stricter cancellation policy since the cost of having the place empty for a night is relatively higher. Thus, in the end the cancellation policy might rather be a consequence of the price and not a predictor of it.
Model 7: 51%
PREDICTION
Selecting the best model
After creating all these models, we now find to find our best model, our final model for predictions.
| names | model1 | model2 | model3 | model4 | model5 | model6 | model7 | |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
||
| 1 | (Intercept) | 3.24392507415685 | 3.35597346847575 | 3.08347857081727 | 3.10248901379474 | 3.10605946777837 | 3.10196665923745 | 3.08645813658736 |
| 2 | (0.019785504271002) | (0.0177105018892755) | (0.016276208523874) | (0.0163893980412307) | (0.0164879271515536) | (0.0153690108711702) | (0.0153018352265915) | |
| 3 | prop_type_simplifiedCondominium | 0.0330445977032774 | 0.0168016350317635 | 0.0105699548454743 | 0.00987868517014782 | 0.00975345382308558 | 0.0143000030955854 | 0.0123851550009953 |
| 4 | (0.00575813618828062) | (0.00513368778722317) | (0.00452377311003556) | (0.00451528451130815) | (0.00451544677657339) | (0.00417845159775206) | (0.00414897595879142) | |
| 5 | prop_type_simplifiedHouse | 0.118500630947505 | 0.125853472231933 | -0.048122795979808 | -0.0477713331755052 | -0.0477875430095159 | 0.0188972706992627 | 0.0189301706164228 |
| 6 | (0.00693343909548727) | (0.00617700475016833) | (0.00593246345892025) | (0.00592078546674152) | (0.00592041924862936) | (0.00569917443777188) | (0.00565789372167747) | |
| 7 | prop_type_simplifiedOther | 0.177630584440358 | 0.167906265738652 | 0.0904593812783645 | 0.0890228067608967 | 0.0884173313350716 | 0.0909957339198253 | 0.0734156433999879 |
| 8 | (0.00871702130046473) | (0.00796730372292408) | (0.00709495502466223) | (0.00708229356288155) | (0.00708849112645184) | (0.00656888883106142) | (0.00711550292038246) | |
| 9 | prop_type_simplifiedTownhouse | 0.176808408361145 | 0.165831841966319 | 0.00227446983316441 | 0.00141494263047994 | 0.00102840247149408 | 0.0141011413539161 | 0.012511462281442 |
| 10 | (0.0101610530565199) | (0.00905255104294122) | (0.0083436073381939) | (0.00832741027414538) | (0.00832919038409289) | (0.0077521651472995) | (0.0076953742155565) | |
| 11 | number_of_reviews | 0.000188330201480727 | 0.000667782655285451 | 0.000598193401154432 | 0.000483693533257583 | 0.000484593393274765 | 0.000231256455875061 | 0.000170461188979712 |
| 12 | (4.84422194236793e-05) | (4.36385685344027e-05) | (3.8493828208453e-05) | (4.01164598762268e-05) | (4.01165312963734e-05) | (3.73285994729104e-05) | (3.73320152471178e-05) | |
| 13 | review_scores_rating | 0.00235009643422962 | 0.0015423644383559 | 0.00167274241022833 | 0.00145496933716715 | 0.00146129812121523 | 0.00131618239499806 | 0.00131040246962515 |
| 14 | (0.000207227653216638) | (0.000185179019346473) | (0.000163563398183893) | (0.000165063256670083) | (0.000165084038411034) | (0.000152822530062709) | (0.00015218268306063) | |
| 15 | room_typeHotel room | 0.112285988624123 | 0.22301927896014 | 0.225869771051116 | 0.226710460119289 | 0.13865058802833 | 0.175495746351611 | |
| 16 | (0.0342344822479748) | (0.0301985215196532) | (0.0301395309019716) | (0.0301406480832505) | (0.0279217966010478) | (0.0278891452739544) | ||
| 17 | room_typePrivate room | -0.270789209925544 | -0.180623782503831 | -0.182832632444119 | -0.182882692122035 | -0.175125354598853 | -0.17048232205997 | |
| 18 | (0.0035501721852792) | (0.00337763384528402) | (0.00337940654477376) | (0.00337928946750577) | (0.00313097759430856) | (0.00311832460795075) | ||
| 19 | room_typeShared room | -0.362231963884556 | -0.261799894388423 | -0.261145982744209 | -0.260898244177251 | -0.245936903474211 | -0.244472972454319 | |
| 20 | (0.0245657858841296) | (0.0216896324897146) | (0.0216465638670587) | (0.0216455681370037) | (0.0200246027216212) | (0.0198787022168316) | ||
| 21 | bathrooms | 0.0581148232975267 | 0.0581555336903183 | 0.0582072586519383 | 0.0540608414234523 | 0.0531546451526248 | ||
| 22 | (0.00469701514858695) | (0.00468769789531073) | (0.0046874767102729) | (0.00433704214934734) | (0.00430771960741996) | |||
| 23 | bedrooms | 0.0488418378223484 | 0.0491489259122079 | 0.0491493594027497 | 0.0465903045220969 | 0.0454694460632258 | ||
| 24 | (0.00216494638308704) | (0.00216094267679201) | (0.00216080694777852) | (0.00200028455541194) | (0.00198660539826041) | |||
| 25 | accommodates | 0.0376285344723029 | 0.037391132785231 | 0.0374321814646748 | 0.0363873262506188 | 0.0359206096397687 | ||
| 26 | (0.00117922096396024) | (0.00117713781811812) | (0.00117724763518589) | (0.00109002725321269) | (0.0010829613387494) | |||
| 27 | host_is_superhostTRUE | 0.0353701343498818 | 0.035619157372872 | 0.0311389320195054 | 0.0294131967812591 | |||
| 28 | (0.00357928311454013) | (0.0035812818989754) | (0.00331201322816401) | (0.00328941708379988) | ||||
| 29 | is_location_exactTRUE | -0.00546855444066015 | ||||||
| 30 | (0.00277095811418745) | |||||||
| 31 | neighbourhood_simplifiedBrokvarterer | 0.0273693670752078 | 0.026626715979707 | |||||
| 32 | (0.0029452620587076) | (0.0029236149180782) | ||||||
| 33 | neighbourhood_simplifiedCenter | 0.161454102535627 | 0.15808924854407 | |||||
| 34 | (0.00383407291109753) | (0.00381272000539777) | ||||||
| 35 | neighbourhood_simplifiedNordvest | -0.0911168532312485 | -0.0898351964873918 | |||||
| 36 | (0.00450570657108232) | (0.00447238366852371) | ||||||
| 37 | neighbourhood_simplifiedWest | -0.0835991898153753 | -0.0805274835870511 | |||||
| 38 | (0.00499555945047542) | (0.00496098639188035) | ||||||
| 39 | cancellation_policymoderate | 0.026915441623387 | ||||||
| 40 | (0.00241262985254636) | |||||||
| 41 | cancellation_policystrict_14_with_grace_period | 0.045671774015651 | ||||||
| 42 | (0.00261742648723956) | |||||||
| 43 | cancellation_policysuper_strict_60 | 0.127383509237797 | ||||||
| 44 | (0.0173043275909361) | |||||||
| 1.1 | #observations | 23076 | 23076 | 23060 | 23054 | 23054 | 23054 | 23054 |
| 2.1 | R squared | 0.0459531953111049 | 0.243048634204171 | 0.412790992447907 | 0.415033368314356 | 0.415132241656385 | 0.4996523643904 | 0.507236552653042 |
| 3.1 | Adj. R Squared | 0.0457050579480577 | 0.242753283372117 | 0.412485247314457 | 0.414703309016964 | 0.414776837836046 | 0.499283120172421 | 0.506808676607936 |
| 4.1 | Residual SE | 0.214628235815306 | 0.191189591060171 | 0.168435142562142 | 0.168097971640982 | 0.168087412523062 | 0.15547857615436 | 0.154305765149136 |
Model 7 is the best model with the highest R2 so this will be the final model.
Final model
Let’s print the final model (no. 7).
final_model <- lm(price_4_nights_log ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
accommodates +
neighbourhood_simplified +
cancellation_policy
+host_is_superhost,
data = listings_log
)
mosaic::msummary(final_model)
## Estimate Std. Error t value
## (Intercept) 3.086e+00 1.530e-02 201.705
## prop_type_simplifiedCondominium 1.239e-02 4.149e-03 2.985
## prop_type_simplifiedHouse 1.893e-02 5.658e-03 3.346
## prop_type_simplifiedOther 7.342e-02 7.116e-03 10.318
## prop_type_simplifiedTownhouse 1.251e-02 7.695e-03 1.626
## number_of_reviews 1.705e-04 3.733e-05 4.566
## review_scores_rating 1.310e-03 1.522e-04 8.611
## room_typeHotel room 1.755e-01 2.789e-02 6.293
## room_typePrivate room -1.705e-01 3.118e-03 -54.671
## room_typeShared room -2.445e-01 1.988e-02 -12.298
## bathrooms 5.315e-02 4.308e-03 12.339
## bedrooms 4.547e-02 1.987e-03 22.888
## accommodates 3.592e-02 1.083e-03 33.169
## neighbourhood_simplifiedBrokvarterer 2.663e-02 2.924e-03 9.107
## neighbourhood_simplifiedCenter 1.581e-01 3.813e-03 41.464
## neighbourhood_simplifiedNordvest -8.984e-02 4.472e-03 -20.087
## neighbourhood_simplifiedWest -8.053e-02 4.961e-03 -16.232
## cancellation_policymoderate 2.692e-02 2.413e-03 11.156
## cancellation_policystrict_14_with_grace_period 4.567e-02 2.617e-03 17.449
## cancellation_policysuper_strict_60 1.274e-01 1.730e-02 7.361
## host_is_superhostTRUE 2.941e-02 3.289e-03 8.942
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.002838 **
## prop_type_simplifiedHouse 0.000822 ***
## prop_type_simplifiedOther < 2e-16 ***
## prop_type_simplifiedTownhouse 0.103997
## number_of_reviews 4.99e-06 ***
## review_scores_rating < 2e-16 ***
## room_typeHotel room 3.18e-10 ***
## room_typePrivate room < 2e-16 ***
## room_typeShared room < 2e-16 ***
## bathrooms < 2e-16 ***
## bedrooms < 2e-16 ***
## accommodates < 2e-16 ***
## neighbourhood_simplifiedBrokvarterer < 2e-16 ***
## neighbourhood_simplifiedCenter < 2e-16 ***
## neighbourhood_simplifiedNordvest < 2e-16 ***
## neighbourhood_simplifiedWest < 2e-16 ***
## cancellation_policymoderate < 2e-16 ***
## cancellation_policystrict_14_with_grace_period < 2e-16 ***
## cancellation_policysuper_strict_60 1.88e-13 ***
## host_is_superhostTRUE < 2e-16 ***
##
## Residual standard error: 0.1543 on 23033 degrees of freedom
## (5469 observations deleted due to missingness)
## Multiple R-squared: 0.5072, Adjusted R-squared: 0.5068
## F-statistic: 1185 on 20 and 23033 DF, p-value: < 2.2e-16
get_regression_table(final_model)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 3.09 | 0.015 | 202 | 0 | 3.06 | 3.12 |
| prop_type_simplifiedCondominium | 0.012 | 0.004 | 2.98 | 0.003 | 0.004 | 0.021 |
| prop_type_simplifiedHouse | 0.019 | 0.006 | 3.35 | 0.001 | 0.008 | 0.03 |
| prop_type_simplifiedOther | 0.073 | 0.007 | 10.3 | 0 | 0.059 | 0.087 |
| prop_type_simplifiedTownhouse | 0.013 | 0.008 | 1.63 | 0.104 | -0.003 | 0.028 |
| number_of_reviews | 0 | 0 | 4.57 | 0 | 0 | 0 |
| review_scores_rating | 0.001 | 0 | 8.61 | 0 | 0.001 | 0.002 |
| room_typeHotel room | 0.175 | 0.028 | 6.29 | 0 | 0.121 | 0.23 |
| room_typePrivate room | -0.17 | 0.003 | -54.7 | 0 | -0.177 | -0.164 |
| room_typeShared room | -0.244 | 0.02 | -12.3 | 0 | -0.283 | -0.206 |
| bathrooms | 0.053 | 0.004 | 12.3 | 0 | 0.045 | 0.062 |
| bedrooms | 0.045 | 0.002 | 22.9 | 0 | 0.042 | 0.049 |
| accommodates | 0.036 | 0.001 | 33.2 | 0 | 0.034 | 0.038 |
| neighbourhood_simplifiedBrokvarterer | 0.027 | 0.003 | 9.11 | 0 | 0.021 | 0.032 |
| neighbourhood_simplifiedCenter | 0.158 | 0.004 | 41.5 | 0 | 0.151 | 0.166 |
| neighbourhood_simplifiedNordvest | -0.09 | 0.004 | -20.1 | 0 | -0.099 | -0.081 |
| neighbourhood_simplifiedWest | -0.081 | 0.005 | -16.2 | 0 | -0.09 | -0.071 |
| cancellation_policymoderate | 0.027 | 0.002 | 11.2 | 0 | 0.022 | 0.032 |
| cancellation_policystrict_14_with_grace_period | 0.046 | 0.003 | 17.4 | 0 | 0.041 | 0.051 |
| cancellation_policysuper_strict_60 | 0.127 | 0.017 | 7.36 | 0 | 0.093 | 0.161 |
| host_is_superhostTRUE | 0.029 | 0.003 | 8.94 | 0 | 0.023 | 0.036 |
get_regression_summaries(final_model)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.507 | 0.507 | 0.0238 | 0.154 | 0.154 | 1.19e+03 | 0 | 20 | 2.31e+04 |
vif(final_model)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.793107 4 1.075724
## number_of_reviews 1.153751 1 1.074128
## review_scores_rating 1.037877 1 1.018762
## room_type 1.312570 3 1.046374
## bathrooms 1.335585 1 1.155675
## bedrooms 2.871572 1 1.694571
## accommodates 2.943557 1 1.715680
## neighbourhood_simplified 1.161450 4 1.018885
## cancellation_policy 1.268876 3 1.040487
## host_is_superhost 1.124756 1 1.060545
Analysis of Coefficients in Final Model
Property Type Apartments have the lowest price. Compared to an apartment, a town house has a 1.2% higher price, a house a 1.9% higher price and a condominium a 1.2% higher price.“Other” property types have a 7.6% higher price than apartments, which makes sense since it includes special properties such as house boats etc.
Number of Reviews Already one additional review on AirBnB increases the price of the place 0.02%
Review_scores_rating For every increase in the review score rating, the price rises by 0.13%
Room Type A shared room and private room will have the lowest prices, with 21.69% and 15.6% respectively less than an entire home. This is intuitive since you have to share rooms and/or share facilities, such as bathroom and kitchen with other people. Hotel rooms have the highest price with around 19% more than an entire home.
Bathrooms & Bedrooms The number of bathrooms and bedrooms also have an influence on the price. An additional bathroom will increase the price by 5.4% and an additional bedroom by 4.6%. (both by roughly 5%).
Accommodates Offering an additional accommodate rises the price by 3.6%
Superhost Being a super host can increase your price by 2.7%, so it might be worth it to make to superhost!
Neighbourhoods Looking at the neighbourhoods, we can also see clear trends there. Highest prices are found in the center, which is the case in most cities. There are 17.2% higher than in Amager. Prices in Brokvarterer are 2.7% higher than in Amager. The lowest prices are in Nordvest and the West (8.6% and 7.7% lower than in Amager), which are more residential areas.
Cancellation Policy Places with flexible cancellation policies have the lowest prices, those with moderate ones are 2.7% higher and those with strict ones 4.9%. The super strict ones have the highest prices with 13.5%
In conclusion, this model can help AirBnB hosts understand what drives AirBnB prices, e.g. which in neighborhoods one can demand higher prices, and learn which features customer accept to pay more for .
Prediction
We are planning to visit Copenhagen and want to stay in an AirBnB. We would like to look at AirBnB’s that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90.
We will predict the total cost to stay at this AirBnB for 4 nights, including the appropriate 95% interval with the prediction.
# Filter our the data by room_type, nr of reviews and rating
listings_predict_log <- listings_log %>%
filter(room_type == "Private room",
number_of_reviews >= 10,
review_scores_rating >= 90,
accommodates >= 2
)
# Generate a prediction for each row
model_predictions_log_CI <- predict(final_model, newdata = listings_predict_log, interval = "confidence")
# Generate a prediction for each row with broom
model_predictions_log <- broom::augment(final_model,
newdata = listings_predict_log, se_fit = TRUE)
# Convert prices back from log
model_predictions <- model_predictions_log %>%
mutate(lower = .fitted - 1.96 * .se.fit,
upper = .fitted + 1.96 * .se.fit) %>%
mutate(fitted_price_final = 10^(.fitted),
upper_final = 10^(upper),
lower_final = 10^(lower))
Next we visualise the data to have a better understanding of the structure.
# Change order of property types to have other in the end
model_predictions$prop_type_simplified <- factor(model_predictions$prop_type_simplified, c("Apartment", "Condominium", "House", "Townhouse", "Other"))
# Graph distribution of fitted values
ggplot(model_predictions, aes(x = fitted_price_final)) +
geom_density() +
labs(y = "",
x = "Price of an AirBnB in Copenhagen") +
theme_classic() +
NULL

# Facet by neighborhood
ggplot(model_predictions, aes(x = fitted_price_final)) +
geom_density() +
labs(y = "",
x = "Price of an AirBnB in Copenhagen") +
facet_wrap(~neighbourhood_simplified) +
theme_classic() +
NULL

# Facet by prop_type_simplified
ggplot(model_predictions, aes(x = fitted_price_final)) +
geom_density() +
labs(y = "",
x = "Price of an AirBnB in Copenhagen") +
facet_wrap(~prop_type_simplified) +
theme_classic() +
NULL

# Facet by neighborhood & prop_type_simplified
ggplot(model_predictions, aes(x = fitted_price_final)) +
geom_density() +
labs(y = "",
x = "Price of an AirBnB in Copenhagen") +
facet_grid(
neighbourhood_simplified ~ prop_type_simplified,
scales = "free"
) +
theme_classic() +
NULL

# Splitting data in training and testing sets
library(rsample)
set.seed(1234) # for reproducibility, and to always get the same split, set the random seed first
train_test_split <- initial_split(listings_log, prop = 0.80)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)
rmse_train <- listings_train %>%
mutate(predictions = predict(final_model, .)) %>%
summarise(sqrt(sum(predictions - price_4_nights_log,na.rm = TRUE)**2/n())) %>%
pull()
rmse_train
## [1] 0.01582266
rmse_test <- listings_test %>%
mutate(predictions = predict(final_model, .)) %>%
summarise(sqrt(sum(predictions - price_4_nights_log,na.rm = TRUE)**2/n())) %>%
pull()
rmse_test
## [1] 0.0316474
# Find mean and median of the fitted prices
model_predictions %>%
summarize(mean_price = mean(fitted_price_final, na.rm = TRUE),
median_price = median(fitted_price_final, na.rm = TRUE)
)
| mean_price | median_price |
|---|---|
| 1.98e+03 | 1.9e+03 |
As the price distribution is right skewed, choosing mean as measure to predict the final price wouldn’t be economical so we’ve decided to use the median because it fits our budget.
predicted_value <- model_predictions %>%
filter(fitted_price_final == median(fitted_price_final, na.rm = TRUE))
print(predicted_value\(lower_final) + print(predicted_value\)fitted_price_final) + print(predicted_value$upper_final)
======= # locate the median value property predicted_value <- model_predictions %>% filter(fitted_price_final == median(fitted_price_final, na.rm = TRUE)) filter(fitted_price_final <= median(fitted_price_final, na.rm = TRUE), fitted_price_final >= median(fitted_price_final, na.rm = TRUE)) >>>>>>> 0e04536b9df3b4269912549b1ce7e7acfc6f8905
predicted_value %>% select(listing_url, accommodates, price, cleaning_fee, extra_people, prop_type_simplified, room_type, review_scores_rating, bathrooms, bedrooms, beds, neighbourhood_cleansed, )
print the point estimates and their confidence intervals of these median properties
predicted_value[, 29:31]
```
Conclusion
To conclude, we can predict, with 95% confidence, that a median listing on AirBnB in Copenhagen, for 2 people and 4 nights, will cost between 1,860.18 DKK and 1,938.58 DKK.